fix(storage-opendal): add TimeoutLayer inside RetryLayer to bound hangs#2455
fix(storage-opendal): add TimeoutLayer inside RetryLayer to bound hangs#2455AlJohri wants to merge 1 commit into
Conversation
The current `create_operator` applies `RetryLayer` with no `TimeoutLayer`. Per opendal's docs, `TimeoutLayer` must be inside `RetryLayer`; without it, a future parked indefinitely on a silently dropped TCP connection never produces an `Err` and `RetryLayer` cannot retry — the caller hangs forever. We hit this in production: an iceberg `try_next()` Range-GET parked for 24h until a Kubernetes Job's `activeDeadlineSeconds` killed the pod. Core-dump analysis confirmed two in-flight Range-GETs sitting in heap with no live TCP connection, no error propagated, no retries attempted. This adds `TimeoutLayer::new()` with opendal's defaults (60s for non-IO ops; 10s per IO chunk) inside `RetryLayer`. Each retry attempt is now independently bounded; a hung connection times out, RetryLayer sees the error, and either succeeds on a subsequent attempt or propagates a clean error to the caller. Refs: apache#1288 (auto-closed by stale bot without engagement), apache#788 (the PR that added the unbounded RetryLayer).
Kurtiscwright
left a comment
There was a problem hiding this comment.
Looks fairly straight forward, just a little confused how this maps to the values contained in the timeout table.
| // Transient errors are common for object stores; we retry temporary | ||
| // failures with exponential backoff. The retry behavior also | ||
| // benefits non-object-store backends. | ||
| let operator = operator.layer(TimeoutLayer::new()).layer(RetryLayer::new()); |
There was a problem hiding this comment.
The original feature request linked to a Table with both min and max timeouts, how does this get encoded here? https://iceberg.apache.org/docs/latest/configuration/#table-behavior-properties
There was a problem hiding this comment.
In this PR I haven't made it configurable at all and just used the OpenDAL defaults but I can definitely update the PR to make it configurable if that's desirable.
I was originally thinking it may be worth just adding the TimeoutLayer in place with the existing defaults similar to the RetryLayer and doing configurability changes in a follow up, but happy to do it now as well.
I didn't see any knobs for configuring the object level retries and timeouts in the table behavior properties link, but looking at other implementations, I could add:
s3.connect-timeout(matches pyiceberg and pyarrow)s3.request-timeout(matches pyiceberg and pyarrow)s3.retry.num-retries(matches java)s3.retry.min-wait-ms(matches java)s3.retry.max-wait-ms(matches java)
There was a problem hiding this comment.
I think its reasonable to follow up with fine grained configurations and merging as is fine. I would request (and I can handle this if your good handing off ownership) that the fine grained configurations are done at the storage trait layer.
I recommended this PR for merging and inclusion into the 0.10 release being gathered here:
| // Transient errors are common for object stores; we retry temporary | ||
| // failures with exponential backoff. The retry behavior also | ||
| // benefits non-object-store backends. | ||
| let operator = operator.layer(TimeoutLayer::new()).layer(RetryLayer::new()); |
There was a problem hiding this comment.
I think its reasonable to follow up with fine grained configurations and merging as is fine. I would request (and I can handle this if your good handing off ownership) that the fine grained configurations are done at the storage trait layer.
I recommended this PR for merging and inclusion into the 0.10 release being gathered here:
Which issue does this PR close?
What changes are included in this PR?
iceberg-storage-opendal::OpenDalStorage::create_operatorcurrently wraps the built operator withRetryLayer::new(). That layer retries when its inner future returnsErr, which is what was needed for the fast-failing transient errors that #788 set out to fix — and it works correctly for that case.It does not bound futures that park indefinitely without producing an
Err. The canonical example is an S3 Range-GET whose underlying TCP connection is silently dropped (NAT/conntrack eviction, route flap, server-side disconnect with no RST received). The response future staysPendingforever;RetryLayerhas nothing to retry against because no error is ever produced.This PR adds
TimeoutLayer::new()insideRetryLayerto close that gap. Per opendal's docs:opendal's defaults (60 s for non-IO ops like
stat/list/delete, 10 s per IO chunk forread/write) are used; each retry attempt is now independently bounded, hung connections surface as a timeout error whichRetryLayerthen retries with backoff, and unrecoverable hangs propagate a clean error to the caller in seconds rather than the inner future parking forever.The diff is two lines in
crates/storage/opendal/src/lib.rs(the import and the layer composition) plus an updated comment explaining the ordering invariant.How we hit this
In production: a Rust application using
iceberg-storage-opendal::OpenDalStorageFactory::S3to read iceberg tables on AWS hung for 24 hours when icebergtry_next()returned aPendingfuture whose underlying opendal Range-GET against S3 never completed. Core-dump analysis showed:/proc/<pid>/net/tcphad only the OTel collector socket).Condvar::wait_until_internal, main thread inRuntime::block_on.So the response future was permanently
Pendingafter the TCP connection silently died, with no error to propagate. TheRetryLayerwas in the chain but dormant because there was no error to react to. AddingTimeoutLayerwould have produced a timeoutErrwithin seconds,RetryLayerwould have retried with backoff, and the operation would have surfaced cleanly within ~90 s instead of hanging until the pod'sactiveDeadlineSecondskilled it 24 h later.Context on the original composition
RetryLayer::new()was added in #788 (Dec 2024) to bound transient"connection closed before message completed"errors. That PR's description explicitly noted that configurability could be a follow-up. It correctly addressed the fast-failing transient case; the silent-hang case wasn't in scope. This PR extends the layer composition to also cover that second class.A user filed #1288 in May 2025 asking for IO-operation timeout support; it received no maintainer engagement and was auto-closed by the stale bot. This PR closes that issue with the minimal change: add a per-attempt bound so
RetryLayerhas a timeout error to retry against. opendal's docs explicitly document the ordering rule that applies when both layers are used together, which this PR follows.Are these changes tested?
Upstream CI on this PR (already running)
The full project CI ran on this draft and the codebase-internal tests pass — including the S3 integration suite against MinIO that I couldn't run locally:
Tests (default)— fullcargo nextest✅ pass (7m55s)Tests (doc)✅ passcheck_standalone(every crate builds in isolation) ✅ passcheck,buildon Linux + macOS ✅ passbuild_with_no_default_featureson Linux + macOS + Windows ✅ passSo healthy-path reads/writes/stats/deletes against MinIO still work with the new layer composition — opendal's defaults (60 s non-IO, 10 s per IO chunk) are not tight enough to false-positive on normal test traffic.
Local validation that the fix actually fixes the bug
The existing test suite doesn't have a "hung-connection" harness, so it can't directly validate the new behaviour. I wrote a small standalone reproducer that does: it spawns a TCP tarpit (accepts connections, never replies — exactly mimicking a silently-dropped TCP session), points opendal at it, and measures time-to-error under three layer stacks.
Result:
RetryLayerhas nothing to retry against when the inner future never produces anErr. This is the gap the PR closes.io timeout reachederror naming the timeout duration. Math checks out: 2 sio_timeout× 3 retry attempts + default exponential backoff between attempts (≈ 1+2+4 s) ≈ 13–15 s.This is reproducible in ~5 minutes by anyone with a Rust toolchain. Happy to upstream this as an integration test in
crates/storage/opendal/tests/if maintainers want — the harness is ~70 lines and depends only onopendal+tokio+anyhow.Click to expand: full tarpit harness source
Cargo.toml:src/main.rs:Notes
TimeoutLayerapplies uniformly. For in-memory/fs backends the timeout is effectively never hit; the cost is negligible. For network backends it is the actual fix.iceberg::Storagethemselves and inject their preferred layer stack — but the default path should not hang silently, which is what this PR addresses. Happy to expand into a configurable form (e.g. accept aTimeoutLayeron theOpenDalStorageFactory::S3variant) in review if maintainers prefer.